Skip to content

get_effective_cell for getting the contents of Excel cell when the cell is merged #4673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cancan101
Copy link
Contributor

Closes #4672

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

is this just a useful method? (e.g. you are not using it anywhere)

tests?

@cancan101
Copy link
Contributor Author

@jreback cancan101@e82bfa4 Has tests for the new method.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

@cancan101 oh..ok

can you elaborate on the utility of this though?

it SEEMS useful...but what is the actual usecase?

@cancan101
Copy link
Contributor Author

Sure. There are plenty of Excel documents where there are multiple columns but some columns share one header cell. A given header cell will apply to multiple columns (think MultiIndex). If I want to read the value of a given header cell when there is the potential the cell is merged, I would use this new method.

The classic example that I see is in SEC filing where you end up with:

3 months ended:             6 months ended:
1/1/2012    1/1/2011        1/1/2012    1/1/2011

@cancan101
Copy link
Contributor Author

@jreback The idea being that a given cell acts as a header for multiple columns

@hayd
Copy link
Contributor

hayd commented Aug 26, 2013

Can you add a docstring to the method to explain this.

To me what you're saying seems a bit magical, so example in docstring would be good. Also, why the name "effective" cell, is this standard?

@cancan101
Copy link
Contributor Author

@hayd I can definitely write a docstring. If you have something better than effective, I can certainly change to that. effective sounded reasonable to me.

@hayd
Copy link
Contributor

hayd commented Aug 26, 2013

Not sure tbh, but there may be an Excel (or more general) term for this. If so we should use it (I'm still not 100% what this is, so I don' think I'm the best person to google it to see if the term exists :) ).

@jtratner
Copy link
Contributor

I'd like to close this in favor of a more general way of handling merged cells [at least at the top of columns]. E.g. if you had something like this in the header:

_________________________________
|Foo        |  Bar               |
|________________________________|
| A | B | C | D | E | F | G | H  |
|________________________________|

This would become a set of columns that are a MultiIndex (and equivalent thing for rows). And the API would just be that you pass in a list/integer for number of rows to consider the "header" or list/integer for number of columns to consider the "index" and then pandas would intelligently handle from there. [so the sum total of the API change would be, potentially, two additional keyword arguments]. Whatever we decide as convention here could also be used for converting DataFrames with MultiIndex into excel files.

If any of this already exists, then that's great.

@jtratner
Copy link
Contributor

and pandas might use something like this function to determine if there are merged cells and, if so, how they should be handled.

@cancan101
Copy link
Contributor Author

@jtratner I like your suggestion of converting the headings into a MultiIndex Whatever solution we come up with for parsing Excel files should also work for HTML tables which I have seen to have the same format.

FWIW, I did find this forum about the issue for Excel files: http://answers.microsoft.com/en-us/office/forum/office_2007-excel/unmerge-cells-and-copy-the-content-in-each/49f46676-e318-4d33-8cac-7c6302214534

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

already for csv, basically header=[0,1]

@cancan101
Copy link
Contributor Author

@jreback What about HTML?

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

I think HTML is quite tricky to do this (I am not sure that it accepts a header option at all), that said if you could specify a way to do it creating the mi is not hard, but why wouldn't you do it after anyhow. HTML is not very regular.

@cancan101
Copy link
Contributor Author

The HTML that I am thinking about is regular enough. The rows all have the same number of "columns" and the html uses colspan where the Excel uses merged cells. See for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

I take it back, I forgot that read_html takes a header argument. So it would be reasonable for it to deal with a list I suppose (to return a mi for the columns).

@cancan101
Copy link
Contributor Author

@jreback what is an "mi" ?

@jreback
Copy link
Contributor

jreback commented Aug 26, 2013

multi-index

@jtratner
Copy link
Contributor

closing in favor of #4682 / #4679

@jtratner jtratner closed this Aug 26, 2013
@jtratner
Copy link
Contributor

@cancan101 yeah, I agree, it's very regular and basically equivalent [i.e., here's a set of cells, some of them have widths and heights]. If you convert to repeating, it's exactly the same as what csv reader would do with them.

@jtratner
Copy link
Contributor

@cancan101 and if it can't make sense of it [i.e., doesn't seem regular], could just fail quickly and make the user munge themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_effective_cell for getting the contents of Excel cell when the cell is merged
4 participants